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1 Introduction 

Causal questions are fundamental in all parts of science. Answering such questions from 
observational data is notoriously difficult, but there has been a lot of recent interest and 
progress in this held. This paper gives a selective review of some of these results, intended 
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for researchers who are not familiar with graphical models and causality, and with a focus 
on methods that are applicable to large data sets. 

In order to clarify the problem formulation, we hrst discuss the difference between 
causal and non-causal questions, and between observational and experimental data. We 
then formulate the problem setting and give an overview of the rest of this paper. 

1.1 Causal versus non-causal research questions 

We use a small hypothetical example to illustrate the concepts. 

Example 1. Suppose that there is a new rehabilitation program for prisoners, aimed at 
lowering the recidivism rate. Among a random sample of 1500 prisoners, 500 participated 
in the program. All prisoners were followed for a period of two years after release from 
prison, and it was recorded whether or not they were rearrested within this period. Table 
shows the (hypothetical) data. We note that the rearrest rate among the participants of 
the program (20%) is significantly lower than the rearrest rate among the non-participants 
(50%). 

Table 1: Hypothetical data about a rehabilitation program for prisoners. 


Rearrested Not rearrested Rearrest rate 

Participants 100 400 20% 

Non-participants 500 500 50% 


We can ask various questions based on these data. For example: 

1. Can we predict whether a prisoner will be rearrested, based on participation in the 
program (and possibly other variables)? 

2. Does the program lower the rearrest rate? 

3. What would the rearrest rate be if the program were compulsory for all prisoners? 

Question 1 is non-causal, since it involves a “standard” prediction or classihcation 
problem. We note that this question can be very relevant in practice, for example in parole 
considerations. However, since we are interested in causality here, we will not consider 
questions of this type. 

Questions 2 and 3 are causal. Question 2 asks if the program is the cause of the 
lower rearrest rate among the participants. In other words, it asks about the mechanism 
behind the data. Question 3 asks a prediction of the rearrest rate after some novel outside 
intervention to the system, namely after making the program compulsory for all prisoners. 
In order to make such a prediction, one needs to understand the causal structure of the 
system. 
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Example 2. We consider gene expression levels of yeast cells. Suppose that we want to 
predict the average gene expression levels after knocking out one of the genes, or after 
knocking out multiple genes at a time. These are again causal guestions, since we want to 
make predictions after interventions to the system. 

Thus, causal questions are about the mechanism behind the data or about predictions 
after a novel intervention is applied to the system. They arise in all parts of science. 
Application areas involving big data include for example systems biology (e.g., 

EDI Ea sni [62]) , neuroscience (e.g., pEQisni [58]). climate science (e.g., [Ml [IT]), and 
marketing (e.g., !I|). 

1.2 Observational versus experimental data 

Going back to the prisoners example, which of the three posed questions can we answer? 
This depends on the origin of the data, and brings us to the distinction between observa¬ 
tional and experimental data. 

Observational data. Suppose hrst that participation in the program was voluntary. 
Then we would have so-called observational data, since the subjects (prisoners) chose their 
own treatment (rehabilitation program or not), while the researchers just observed the 
results. From observational data, we can easily answer question 1. It is difficult, however, 
to answer questions 2 and 3. 

Let us hrst consider question 2. Since the participants form a self-selected subgroup, 
there may be many differences between the participants and the non-participants. For 
example, the participants may be more motivated to change their lives, and this may 
contribute to the difference in rearrest rates. In this case, the effects of the program and 
the motivation of the prisoners are said to be mixed-up or confounded. 

Next, let us consider question 3. At hrst sight, one may think that the answer is simply 
20%, since this was the rearrest rate among the participants of the program. But again we 
have to keep in mind that the participants form a self-selected subgroup that is likely to 
have special characteristics. Hence, the rearrest rate of this subgroup cannot be extrapo¬ 
lated to the entire prisoners population. 

Experimental data. Now suppose that it was up to the researchers to decide which 
prisoners participated in the program. For example, suppose that the researchers rolled 
a die for each prisoner, and let him/her participate if the outcome was 1 or 2. Then we 
would have a so-called randomized controlled experiment and experimental data. 

Let us look again at question 2. Due to the randomization, the motivation level of 
the prisoners is likely to be similar in the two groups. Moreover, any other factors of 
importance (like social background, type of crime committed, number of earlier crimes, 
etcetera) are likely to be similar in the two groups. Hence, the groups are equal in all 
respects, except for participation in the program. The observed difference in rearrest rate 
must therefore be due to the program. This answers question 2. 
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Finally, the answer to question 3 is now 20%, since the randomized treatment assign¬ 
ment ensures that the participants form a representative sample of the population. 

Thus, causal questions are best answered by experimental data, and we should work 
with such data whenever possible. Experimental data is not always available, however, 
since randomized controlled experiments can be unethical, infeasible, time consuming or 
expensive. On the other hand, observational data is often relatively cheap and abundant. 
In this paper, we therefore consider the problem of answering causal questions about large- 
scale systems from observational data. 

1.3 Problem formulation 

It is relatively straightforward to make “standard” predictions based on observational data 
(see the “observational world” in Figure [^, or to estimate causal effects from random¬ 
ized controlled experiments (see the “experimental world” in Figure]^. But we want to 
estimate causal effects from observational data. This means that we need to move from 
the observational world to the experimental world. This step is fundamentally impossible 
without causal assumptions, even in the large sample limit with perfect knowledge about 
the observational distribution (cf. Section 2 of |13])- In other words, causal assumptions 
are needed to deduce the post-intervention distribution from the observational distribution. 
In this paper, we assume that the data were generated from a (known or unknown) causal 
structure which can be represented by a directed acyclic graph (DAG). 

Outline of this paper. In the next section, we assume that the data were generated 
from a known DAG. In particular, we discuss the framework of a structural equation model 
(SEM) and its corresponding causal DAG. We also discuss the estimation of causal effects 
under such a model. In large-scale networks, however, the causal DAG is often unknown. 
Next, we therefore discuss causal structure learning, that is, learning information about 
the causal structure from observational data. We then combine these two parts and discuss 
methods to estimate (bounds on) causal effects from observational data when the causal 
structure is unknown. We also illustrate this method on a yeast gene expression data set. 
We close by mentioning several extensions of the discussed work. 

2 Estimating causal effects when the causal structure 
is known 

Gausal structures can be represented by graphs, where the random variables are represented 
by nodes (or vertices), and causal relationships between the variables are represented by 
edges between the corresponding nodes. Such causal graphs have two important practical 
advantages. First, a causal graph provides a transparent and compact description of the 
causal assumptions that are being made. This allows these assumptions to be discussed 
and debated among researchers. Next, after agreeing on a causal graph, one can easily 
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Fignre 1: We want to estimate cansal effects from observational data. This means that we 
need to move from the observational world to the experimental world. This can only done 
by imposing cansal assnmptions. 


determine cansal effects. In particnlar, we can read off from the graph which sets of 
variables can or cannot be used for covariate adjustment in order to obtain a given causal 
effect. We refer to HI] for further details on the material in this section. 

2.1 Graph terminology 

We consider graphs with directed edges (—)■) and undirected edges (—). There can be at 
most one edge between any pair of distinct nodes. If all edges are directed (undirected), 
then the graph is called directed {undirected). A partially directed graph can contain both 
directed and undirected edges. The skeleton of a partially directed graph is the undirected 
graph that results from replacing all directed edges by undirected edges. 

Two nodes are adjacent if they are connected by an edge. If X —)■ F, then X is a 
parent of Y. The adjacency set and the parent set of a node X in a graph Q are denoted 
by adj{X,Q) and pa{X,Q), respectively. A graph is complete if every pair of nodes is 
adjacent. 

A path in a graph ^ is a distinct sequence of nodes, such that all successive pairs of 
nodes in the sequence are adjacent in Q. A directed path from X to F is a path between 
X and F in which all edges point towards F, i.e., X —)■■■■—)■ F. A directed path from X 
to F together with an edge F —)■ X forms a directed cycle. A directed graph is acyclic if 
it does not contain directed cycles. A directed acyclic graph is also called a DAG. 

A node X is a collider on a path if the path has two colliding arrows at X, that is, the 
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path contains —)■ X <(—. Otherwise X is a non-collider on the path. We emphasize that the 
collider status of a node is relative to a path; a node can be a collider on one path, while 
it is a non-collider on another. The collider X is unshielded if the neighbors of X on the 
path are not adjacent to each other in the graph, that is, the path contains W ^ X Z 
and W and Z are not adjacent in the graph. 

2.2 Structural equation model (SEM) 

We consider a collection of random variables Xi,... ,Xp that are generated by structural 
equations (see, e.g. [U1 lUU]): 

Xi^ giiSi^d), i = l,...,p, (1) 

where Sj C {Xi,..., Xp} \ {X*} and is some random noise. We interpret these equations 
causally, as describing how each Xj is generated from the variables in Sj and the noise 
Cj. Thus, changes to the variables in S* can lead to changes in Xj, but not the other 
way around. We use the notation in ([^ to emphasize this asymmetric relationship. 
Moreover, we assume that the structural equations are autonomous, in the sense that 
we can change one structural equation without affecting the others. This will allow the 
modelling of local interventions to the system. 

The structural equations correspond to a directed graph Q that is generated as follows: 
the nodes are given by Xi,..., Xp, and the edges are drawn so that S* is the parent set of 
Xi, i = 1,... ,p. The graph Q then describes the causal structure and is called the causal 
graph', the presence of an edge Xj —)■ Xj means that Xj is a potential direct cause of Xj 
(i.e., Xj may play a role in the generating mechanism of Xj), and the absence of an edge 
Xfc —)■ Xj means that X^ is dehnitely not a direct cause of Xj (i.e., X^ does not play a role 
in the generating mechanism of Xj). 

Throughout, we make several assumptions about the model. The graph Q is assumed 
to be acyclic (hence a DAG), and the error terms ei,... ,ep are jointly independent. In 
terms of the causal interpretation, these assumptions mean that we do not allow feedback 
loops nor unmeasured confounding variables. The above model with these assumptions 
was called a structural causal model by |12]. We will simply refer to it as a structural 
equation model (SEM). If all structural equations are linear, we will call it a linear SEM. 

We now discuss two important properties of SEMs, namely factorization and d-separation. 
If Xi,..., Xp are generated from a SEM with causal DAG Q, then the density /(xi,..., Xp) 
of Xi,... ,Xp (assuming it exists) factorizes as: 

p 

f{xi, ...,Xp) = Y[fi{xi\pa{xi,g)), (2) 

i=l 

where fi{xi\pa{xi,Q)) is the conditional density of Xj given pa{Xi,Q). 

If a density factorizes according to a DAG as in ([^, then one can use the DAG to read 
off conditional independencies that must hold in the distribution (regardless of the choice 
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of the /j(-)’s), using a graphical criterion called d-separation (see, e.g., Dehnition 1 in |13]). 
In particular, the so-called global Markov property implies that when two disjoint sets A 
and B of vertices are d-separated by a third disjoint set S, then A and B are conditionally 
independent given S (A _LL B|S) in any distribution that factorizes according to the DAG. 

Example 3. We consider the following structural equations and the corresponding causal 
DAG for the random variables P, S, R and M: 


P <(— gi{M, cp) 

S ^ g2{P, ^s) 

R^gsiM, S,eR) 
M <(— g^i^M) 


M 



P 


s- 


R 


where cp, cs, cp and cm are mutually independent with arbitrary mean zero distributions. 
For each structural equation, the variables on the right hand side appear in the causal DAG 
as the parents of the variable on the left hand side. 

We denote the random variables by M, P, S and R, since these structural equations 
can be used to describe a possible causal mechanism behind the prisoners data (Example 
[^, where M = measure of motivation, P = participation in the program (P = 1 means 
participation, P = 0 otherwise), S = measure of social skills taught by the program, and 
R = rearrest (R = 1 means rearrest, P = 0 otherwise). 

We see that the causal DAG of this SEM indeed provides a clear and compact description 
its causal assumptions. In particular, it allows that motivation directly affects participa¬ 
tion and rearrest. Moreover, it allows that participation directly affects social skills, and 
that social skills directly affect rearrest. The missing edge between M and S encodes the 
assumption that there is no direct effect from motivation on social skills. In other words, 
any effect of motivation on social skills goes entirely through participation (see the path 
M ^ P ^ S). Similarly, the missing edge between P and R encodes the assumption that 
there is no direct effect of participation on rearrest; any effect of participation on rearrest 
must fully go through social skills (see the path P ^ S ^ R). 

2.3 Post-intervention distributions and causal effects 

Now how does the framework of the SEM allow us to move between the observational and 
experimental worlds? This is straightforward, since an intervention at some variable Xi 
simply means that we change the generating mechanism of Xi, that is, we change the cor¬ 
responding structural equation g^f) (and leave the other structural equations unchanged). 
For example, one can let X^ ^ Ci where e* has some given distribution, or Xi t— x\ for some 
hxed value x\ in the support of Aj. The latter is often denoted as Pearl’s do-intervention 
do{Xi = x'i) and is interpreted as setting the variable X* to the value a:' by an outside 
intervention, uniformly over the entire population |l3]. 
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Example 4. In the prisoners example (see Examples 0 and the quantity P{R = 

l\do{P = 1)) represents the rearrest probability when all prisoners are forced to partici¬ 
pate in the program, while P{R = l\do{P = 0)) is the rearrest probability if no prisoner 
is allowed to participate in the program. We emphasize that these quantities are generally 
not equal to the usual conditional probabilities P{R = 1\P = 1) and P{R = 1\P = 0), 
which represent the rearrest probabilities among prisoners who choose to participate or not 
to participate in the program. 

In the gene expression example (see Example let Xi and Xj represent the gene 
expression level of genes i and j. Then E{Xj\do{Xi = x'fj) represents the average expression 
level of gene j after setting the gene expression level of gene i to the value x[ by an outside 
intervention. 


Truncated factorization formula. A do-intervention on Xi means that Xi no longer 
depends on its former parents in the DAG, so that the incoming edges into Xi can be 
removed. This leads to a so-called truncated DAG. The post-intervention distribution 
factorizes according to this truncated DAG, so that we get: 


/(xi,.. .,Xp\do{Xi = x'i)) 


Uj^i <3)) iixi = x'i, 

0 otherwise. 


(3) 


This is called the truncated factorization formula [Hj, the manipulation formula [32] or 
the g-formula |S2]- Note that this formula heavily uses the factorization formula ([^ and 
the “autonomy assumption” (see page . 


Defining the total effect. Summary measures of the post-intervention distribution can 
be used to define total causal effects. In the prisoners example, it is natural to define the 
total effect of P on i? as 


P(P = l\do{P = 1)) - P(P = l|do(P = 0)). 


Again, we emphasize that this is different from P(P = 1|P = 1) — P(P = 1|P = 0). 

In a setting with continuous variables, the total effect of X, on Y can be defined as 


A 

dxi 


E{Y\do{Xi 



Computing the total effect. A total effect can be computed using, for example, covariate 
adjustment |l3l|57], inverse probability weighting (IPW) [53l[23], or instrumental variables 
(e.g, [1]). In all these methods, the causal DAG plays an important role, since it tells 
us which variables can be used for covariate adjustment, which variables can be used as 
instruments, or which weights should be used in IPW. 

In this paper, we focus mostly on linear SEMs. In this setting, the total effect of Xi on 
Y can be easily computed via linear regression with covariate adjustment. If D G pa(Xi, Q) 



then the effect of Xi on Y equals zero. Otherwise, it equals the regression coefficient of 
Xi in the linear regression of Y on Xj and pa{Xi, Q) (see Proposition 3.1 of [39]). In other 
words, we simply regress Y on Xj while adjusting for the parents of Xj in the causal DAG. 
This is also called “adjusting for direct causes of the intervention variable”. 

Example 5. We consider the following linear SEM: 


X\ i — 2X4 T 61 
X2 ■<— 3X1 + 62 
X3 ■<— 2X2 + X4 + 63 
X4 i — 64 



The errors are mutually independent with arbitrary mean zero distributions. We note that 
the coefficients in the structural equations are depicted as edge weights in the causal DAG. 

Suppose we are interested in the total effect of Xi on X 3 . Then we consider an outside 
intervention that sets Xi to the value xi, i.e., do{Xi = xi). This means that we change the 
structural equation for Xi to Xi ^ xi. Since the other structural equations do not change, 
we then obtain X 2 = 3xi + 62 , X 4 = 64 and X 3 = 2 X 2 + X 4 + 63 = 6 x 1 + 2 e 2 + 64 + 63 . 
Hence, E^X^ldo^Xi = xi)) = 6 x 1 , and differentiating with respect to xi yields a total effect 
of 6. 

We note that the total effect of Xi on X 3 also equals the product of the edge weights 
along the directed path Xi —)■ X 2 —)■ X 3 . This is true in general for linear SEMs: the total 
effect of Xi on Y can be obtained by multiplying the edge weights along each directed path 
from Xi to Y, and then summing over the directed paths (if there is more than one). 

The total effect can also be obtained via regression. Since pa{Xi,Q) = {X 4 }, the total 
effect of Xi on X 3 equals the coefficient of Xi in the regression of X^ on Xi and X4. It 
can be easily verified that this again yields 6. One can also verify that adjusting for any 
other subset o/{X 2 ,X 4 } does not yield the correct total effect. 


3 Causal structure learning 


The material in the previous section can be used if the causal DAG is known. In settings 
with big data, however, it is rare that one can draw the causal DAG. In this section, we 
therefore consider methods for learning DAGs from observational data. Such methods are 
called causal structure learning methods. 


Recall from Section 2^ that DAGs encode conditional independencies via d-separation. 
Thus, by considering conditional independencies in the observational distribution, one may 
hope to reverse-engineer the causal DAG that generated the data. Unfortunately, this does 
not work in general, since the same set of d-separation relationships can be encoded by 
several DAGs. Such DAGs are called Markov equivalent and form a Markov equivalence 
class. 
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A Markov equivalence class can be described uniquely by a completed partially directed 
acyclic graph (CPDAG) [Sj [9]. The skeleton of the CPDAG is dehned as follows. Two 
nodes Xj and Xj are adjacent in the GPDAG if and only if, in any DAG in the Markov 
equivalence class, X, and Xj cannot be d-separated by any set of the remaining nodes. 
The orientation of the edges in the GPDAG is as follows. A directed edge X* —)■ Xj in the 
GPDAG means that the edge X* —)■ Xj occurs in all DAGs in the Markov equivalence class. 
An undirected edge Xj — Xj in the GPDAG means that there is a DAG in the Markov 
equivalence class with Xj —)■ Xj, as well as a DAG with Xj ^ Xj. 

It can happen that a distribution contains more conditional independence relationships 
than those that are encoded by the DAG via d-separation. If this is not the case, then the 
distribution is called faithful with respect to the DAG. If a distribution is both Markov 
and faithful with respect to a DAG, then the conditional independencies in the distribu¬ 
tion correspond exactly to d-separation relationships in the DAG, and the DAG is called 
a perfect map of the distribution. 

Problem setting. Throughout this section, we consider the following setting. We are 
given n i.i.d. observations of X, where X = (Xi,... ,Xp) is generated from a SEM. We 
assume that the corresponding causal DAG ^ is a perfect map of the distribution of X. 
We aim to learn the Markov equivalence class of Q. 

In the following three subsections we discuss so-called constraint-based, score-based 
and hybrid methods for this task. The discussed algorithms are available in the R-package 
pcalg [29]. In the last subsection we discuss a class of methods that can be used if one is 
willing to impose additional restrictions on the SEM that allow identihcation of the causal 
DAG (rather than its GPDAG). 

3.1 Constraint-based methods 

Gonstraint-based methods learn the GPDAG by exploiting conditional independence con¬ 
straints in the observational distribution. The most prominent example of such a method 
is probably the PG algorithm [60] . This algorithm Erst estimates the skeleton of the un¬ 
derlying GPDAG, and then determines the orientation of as many edges as possible. 

We discuss the estimation of the skeleton in more detail. Recall that, under the Markov 
and faithfulness assumptions, two nodes Xj and Xj are adjacent in the GPDAG if and 
only if they are conditionally dependent given all subsets of X \ {Xj,Xj}. Therefore, 
adjacency of Xj and Xj can be determined by testing Xj _LL X^jS for all possible subsets 
S C X \ {Xi,Xj}. This naive approach is used in the SGS algorithm [60|. It quickly 
becomes computationally infeasible for a large number of variables. 

The PG algorithm avoids this computational trap by using the following fact about 
DAGs: two nodes Xj and Xj in a DAG Q are d-separated by some subset of the remaining 
nodes if and only if they are d-separated by pa{Xi,Q) or by pa{Xj,Q). This fact may 
seem of little help at hrst, since we do not know pa{Xi,Q) and pa{Xj,Q) (then we would 
know the DAG!). It is helpful, however, since it allows a clever ordering of the conditional 
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independence tests in the PC algorithm, as follows. The algorithm starts with a complete 
undirected graph. It then assesses, for all pairs of variables, whether they are marginally 
independent. If a pair of variables is found to be independent, then the edge between them 
is removed. Next, for each pair of nodes (X*, Xj) that are still adjacent, it tests conditional 
independence of the corresponding random variables given all possible subsets of size 1 of 
adj{Xi,Q*) \ {Xj} and of adj(Xj,Q*) \ {Xj}, where Q* is the current graph. Again, it 
removes the edge if such a conditional independence is deemed to be true. The algorithm 
continues in this way, considering conditioning sets of increasing size, until the size of the 
conditioning sets is larger than the size of the adjacency sets of the nodes. 

This procedure gives the correct skeleton when using perfect conditional independence 
information. To see this, note that at any point in the procedure, the current graph is a 
supergraph of the skeleton of the CP DAG. By construction of the algorithm, this ensures 
that Xj _LL Xj\pa{Xi,Q) and Xj _LL Xj\pa{Xj,Q) were assessed. 

After applying certain edge orientation rules, the output of the PC algorithm is a 
partially directed graph, the estimated CPDAG. This output depends on the ordering of 
the variables (except in the limit of an inhnite sample size), since the ordering determines 
which conditional independence tests are done. This issue was studied in [H], where it 
was shown that the order-dependence can be very severe in high-dimensional settings with 
many variables and a small sample size (see Section 4.3 for a data example). Moreover, 
na proposed an order-independent version of the PC algorithm, called PC-stable. This 
version is now the default implementation in the R-package pcalg [29] . 

We note that the user has to specify a signihcance level a for the conditional indepen¬ 
dence tests. Due to multiple testing, this parameter does not play the role of an overall 
signihcance level. It should rather be viewed as a tuning parameter for the algorithm, 
where smaller values of a typically lead to sparser graphs. 

The PC and PC-stable algorithms are computationally feasible for sparse graphs with 
thousands of variables. Both PC and PC-stable were shown to be consistent in sparse high¬ 
dimensional settings, when the joint distribution is multivariate Gaussian and conditional 
independence is assessed by testing for zero partial correlation [2H1 HI]- By using Rank 
correlation, consistency can be achieved in sparse high-dimensional settings for a broader 
class of Gaussian copula or nonparanormal models m- 


3.2 Score-based methods 

Score-based methods learn the CPDAG by (greedily) searching for an optimally scoring 
DAG, where the score measures how well the data hts to the DAG, while penalizing the 
complexity of the DAG. 

A prominent example of such an algorithm is the greedy equivalence search (GES) 
algorithm ra. GES is a grow-shrink algorithm that consists of two phases: a forward 
phase and a backward phase. The forward phase starts with an initial estimate (often 
the empty graph) of the CPDAG, and sequentially adds single edges, each time choosing 
the edge addition that yields the maximum improvement of the score, until the score 
can no longer be improved. The backward phase starts with the output of the forward 
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phase, and sequentially deletes single edges, each time choosing the edge deletion that 
yields a maximum improvement of the score, until the score can no longer be improved. 
A computational advantage of GES over the traditional DAG-search methods is that it 
searches over the space of all possible GPDAGs, instead of over the space of all possible 
DAGs. 

The GES algorithm requires the scoring criterion to be score equivalent, meaning that 
every DAG in a Markov equivalence class gets the same score. Moreover, the choice 
of scoring criterion is crucial for computational and statistical performances. The so- 
called decomposability property of a scoring criterion allows fast updates of scores during 
the forward and the backward phase. For example, (penalized) log-likelihood scores are 
decomposable, since ([^ implies that the (penalized) log-likelihood score of a DAG can be 
computed by summing up the (local) scores of each node given its parents in the DAG. 
Finally, the so-called consistency property of a scoring criterion ensures that the true 
GPDAG gets the highest score with probability approaching one (as the sample size tends 
to inhnity). 

GES was shown to be consistent when the scoring criterion is score equivalent, decom¬ 
posable and consistent. For multivariate Gaussian or multinomial distributions, penalized 
likelihood scores such as BIG satisfy these assumptions. 

3.3 Hybrid methods 

Hybrid methods learn the GPDAG by combining the ideas of constraint-based and score- 
based methods. Typically, they hrst estimate (a supergraph of) the skeleton of the GPDAG 
using conditional independence tests, and then apply a search and score technique while 
restricting the set of allowed edges to the estimated skeleton. A prominent example is the 
Max-Min Hill-Glimbing (MMHG) algorithm [66] . 

The restriction on the search space of hybrid methods provides a huge computational 
advantage when the estimated skeleton is sparse. This is why the hybrid methods scale well 
to thousands of variables, whereas the unrestricted score-based methods do not. However, 
this comes at the cost of inconsistency or at least at the cost of a lack of consistency proofs. 
Interestingly, empirical results have shown that a restriction on the search space can also 
help to improve the estimation quality |66]. 

This gap between theory and practice was addressed in [38] , who proposed a consistent 
hybrid modification of GES, called ARGES. The search space of ARGES mainly depends 
on an estimated conditional independence graph. (This is an undirected graph containing 
an edge between Xt and Xj if and only if Xi J/L Xj\V \ {Xi,Xj}. It is a supergraph 
of the skeleton of the GPDAG.) But the search space also changes adaptively depending 
on the current state of the algorithm. This adaptive modihcation is necessary to achieve 
consistency in general. The fact that the modihcation is relatively minor may provide an 
explanation for the empirical success of (inconsistent) hybrid methods. 
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3.4 Learning SEMs with additional restrictions 

Now that we have looked at various different methods to estimate the CPDAG, we close 
this section by discussing a slightly different approach that allows estimation of the causal 
DAG rather than its GPDAG. Identification of the DAG can be achieved by imposing 
additional restrictions on the generating SEM. Examples of this approach include the 
LiNGAM method for linear SEMs with non-Gaussian noise [MIISS], methods for nonlinear 
SEMs [21] and methods for linear Gaussian SEMs with equal error variances [IB] , 

We discuss the LiNGAM method in some more detail. A linear SEM can be written 
as X = 5X + e or equivalently X = Ae with A = {I — B)~^. The LiNGAM algorithm of 
lai uses independent component analysis (IGA) to obtain estimates A and B = I — A ^ 
of A and B. Ideally, rows and columns of B can be permuted to obtain a lower triangular 
matrix and hence an estimate of the causal DAG. This is not possible in general in the 
presence of sampling errors, but a lower triangular matrix can be obtained by setting some 
small non-zero entries to zero and permuting rows and columns of B. 

A more recent implementation of the LiNGAM algorithm, called DirectLiNGAM was 
proposed by [55]. This implementation is not based on IGA. Rather, it estimates the 
variable ordering by iteratively finding an exogenous variable. DirectLiNGAM is suitable 
for settings with a larger number of variables. 


4 Estimating the size of causal effects when the causal 
structure is unknown 

We now combine the previous two sections and discuss methods to estimate bounds on 
causal effects from observational data when the causal structure is unknown. We first 
define the problem setting. 

Problem setting: We have n i.i.d. realizations of X, where X is generated from a linear 
SEM with Gaussian errors. We do not know the corresponding causal DAG, but we assume 
that it is a perfect map of the distribution of X. Our goal is to estimate the sizes of causal 
effects. 

We first discuss the IDA method [22] to estimate the effect of single interventions in 
this setting (for example a single gene knockout). Next, we consider a generalization of this 
approach for multiple simultaneous interventions, called jointIDA [29]. Finally, we present 
a data application from [221111]. 

4.1 IDA 

Suppose we want to estimate the total effect of Xi on a response variable Y. The conceptual 
idea of IDA is as follows. We first estimate the GPDAG of the underlying causal DAG, using 
for example the PG algorithm. Next, we can list all the DAGs in the Markov equivalence 
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For each parent set, estimate the 



Figure 2: Schematic representation of the IDA algorithm, taken from [39] , 


class described by the estimated CPDAG. Under our assumptions and in the large sample 
limit, one of these DAGs is the true causal DAG. We can then apply covariate adjustment 
for each DAG, yielding an estimated total effect of Xi on Y for each possible DAG. We 
collect all these effects in a multiset 0. Bounds on 0 are estimated bounds on the true 
causal effect. 

For large graphs, it is computationally intensive to list all the DAGs in the Markov 
equivalence class. However, since we can always use the parent set of Xi as adjustment 
set (see Section 2.3), it suffices to know the parent set of Xi for each of the DAGs in the 
Markov equivalence class, rather than the entire DAGs. These possible parent sets of Xi 
can be extracted easily from the GPDAG. It is then only left to count the number of DAGs 
in the Markov equivalence class with each of these parent sets. In [33] the authors used 
a shortcut, where they only looked whether a parent set is locally valid or not, instead of 
counting the number of DAGs in the Markov equivalence class. Here locally valid means 
that the parent set does not create a new unshielded collider with Xi as collider. This 
shortcut results in a set 0^ which contains the same distinct values as 0, but might have 
different multiplicities. Hence, if one is only interested in bounds on causal effects, the 
information in 0^ is sufficient. In other cases, however, the information on multiplicities 
might be important, for example if one is interested in the direction of the total effect 
(0 = {1,1,1,1,1, —1} would make us guess the effect is positive, while = {1, —1} loses 
this information). 

IDA was shown to be consistent in sparse high-dimensional settings. 


4.2 JointIDA 

We can also estimate the effect of multiple simultaneous or joint interventions. For example, 
we may want to predict the effect of a double or triple gene knockout. 

Generalizing IDA to this setting poses several non-trivial challenges. First, even if the 
parent sets of the intervention sets are known, it is non-trivial to estimate the size of a total 
joint effect, since a straightforward adjusted regression no longer works. Available methods 
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for this purpose are IPW [53] and the recently developed methods RRC [39] and MCD [39] . 
Under our assumptions, RRC recursively computes joint effects from single intervention 
effects, and MCD produces an estimate of the covariance matrix of the interventional 
distribution by iteratively modifying Cholesky decompositions of covariance matrices. 

Second, we must extract possible parent sets for the intervention nodes from the esti¬ 
mated CPDAG. The local method of IDA can no longer be used for this purpose, since 
some combinations of locally valid parent sets of the intervention nodes may not yield a 
“jointly valid” combination of parent sets. In [SH] the authors proposed a semi-local algo¬ 
rithm for obtaining jointly valid parent sets from a CPDAG. The runtime of this semi-local 
algorithm is comparable to the runtime of the local algorithm in sparse settings. Moreover, 
the semi-local algorithm has the advantage that it (asymptotically) produces a multiset 
of joint intervention effects with correct multiplicities (up to a constant factor). It can 
therefore also be used in IDA if the multiplicity information is important. 

JointIDA based on RRG or MGD was shown to be consistent in sparse high-dimensional 
settings. 


4.3 Application 


The IDA method is based on various assumptions, including multivariate Gaussianity, 
faithfulness, no hidden variables, and no feedback loops. In practice, some of these as¬ 
sumptions are typically violated. It is therefore very important to see how the method 
performs on real data. 

Validations were conducted in [32] on the yeast gene expression compendium of [26] , 
and in [^2] on gene expression data of Arabidopsis Thaliana. JointIDA was validated in 
[39] on the DREAM4 in silico network challenge [3l]. We refer to these papers for details. 

In the remainder, we want to highlight the severity of the order-dependence of the PG 
algorithm in high-dimensional settings (see Section 3.1), and also advocate the use of sub- 
sampling methods. We will discuss these issues in the context of the yeast gene expression 
data of pn] • These data contain both observational and experimental data, obtained under 


similar conditions. We focus here on the observational data, which contain gene expression 
levels of 5361 genes for 63 wild-type yeast organisms. 

Let us hrst consider the order-dependence. The ordering of the columns in our 63 x 5361 
observational data matrix should be irrelevant for our problem. But permuting the order 
of the columns (genes) dramatically changed the estimated skeleton. This is visualized 
in Figure [^ for 25 random orderings. Each estimated skeleton contained roughly 5000 
edges. Only about 2000 of those were stable, in the sense that they occurred in almost 
all estimated skeletons. We see that PG-stable (in red) selected the more stable edges. 
Perhaps surprisingly, it did this via a small modihcation of the algorithm (and not by 
actually estimating skeletons for many different variable orderings). 

Next, we consider adding sub-sampling. Figurej^shows ROG curves for various versions 
of IDA. In particular, there are three versions of PG: PG, PG-stable and MPG-stable. Here 
PG-stable yields an order-independent skeleton, and MPG-stable also stabilizes the edge 
orientations. For each version of IDA, one can add stability selection (SS) or stability 
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5000 10000 15000 

edges 


(a) Black entries indicate edges occurring in 
the estimated skeletons using the PC algorithm, 
where each row in the figure corresponds to a 
different random variable ordering. The original 
ordering is shown as variable ordering 26. The 
edges along the x-axis are ordered from edges 
that occur in the estimated skeletons for all or¬ 
derings, to edges that only occur in the skeleton 
for one of the orderings. Red entries denote edges 
in the uniquely estimated skeleton using the PC- 
stable algorithm over the same 26 variable order¬ 
ings (shown as variable ordering 27). 



(b) The step function shows the proportion of 
the 26 variable orderings in which the edges were 
present for the original PC algorithm, where the 
edges are ordered as in Figure [3(a^ The red bars 
show the edges present in the estimated skeleton 
using the PC-stable algorithm. 


Figure 3: Analysis of estimated skeletons of the CPDAGs for the yeast gene expression 
data m, using the PC and PC-stable algorithms with tuning parameter a = 0.01. The 
PC-stable algorithm yields an order-independent skeleton that roughly captures the edges 
that were stable among the different variable orderings for the original PC algorithm. 
Taken from [IT] . 


selection where the variable ordering is permuted in each sub-sample (SSP). We note that 
adding SSP yields an approximately order-independent algorithm. The best choice in this 
setting seems PC-stable -b SSP. 


5 Extensions 

There are various extensions of the methods described in the previous sections. We only 
mention some directions here. 


Local causal structure learning. 


Recall from Section 2.3 that we can determine the 
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Figure 4: Analysis of the yeast gene expression data [26] with PC (black lines), PC-stable 
(red lines), and MPC-stable (blue lines), using the original ordering over the variables 
(solid lines), using 100 runs stability selection without permuting the variable orderings 
(dashed lines, labelled with SS”), and using 100 runs stability selection with permuting 
the variable orderings (dotted lines, labelled with “-1- SSP”). The grey line labelled as 
“RG” represents random guessing. Taken from na. 


total effect of Xi on Y by adjusting for the direct causes, that is, by adjusting for the 
parents of Xi in the causal graph. Hence, if one is interested in a specihc intervention 
variable Xj, it is not necessary to learn the entire CPDAG. Instead, one can try to learn 
the local structure around X^. Algorithms for this purpose include, e.g., [651 HU mig. 


Causal structure learning in the presence of hidden variables and feedback 

loops. Maximal ancestral graphs (MAGs) can represent conditional independence infor¬ 
mation and causal relationships in DAGs that include unmeasured (hidden) variables [50] . 
Partial ancestral graphs (PAGs) describe a Markov equivalence class of MAGs. PAGs can 
be learned from observational data. A prominent algorithm for this purpose is the FGI 
algorithm, an adaptation of the PG algorithm [53 EB EDI ED]- Adaptations of FGI that 
are computationally more efficient include RFGI and FGI-I- [ig E3]. High- dimensional 
consistency of FGI and RFGI was shown by ng. The order-dependence issues studied in 
[14] (see Section 3.1) apply to all these algorithms, and order-independent versions can 
be easily derived. The algorithms FGI, RFGI and FGI-I- are available in the R-package 
pcalg [2D]- There is also an adaptation of LiNGAM that allows for hidden variables [25] . 
Gausal structure learning methods that allow for feedback loops can be found in [5ni37] l36]- 


Time series data. Time series data are suitable for causal inference, since the time 
component contains important causal information. There are adaptations of the PG and 
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FCI algorithms for time series data [TTl [T8l fTH] . These are computationally intensive when 
considering several time lags, since they replicate variables for the different time lags. 

Another approach for discrete time series data consists of modelling the system as a 
structural vector autoregressive (SVAR) model. One can then use a two-step approach, 
hrst estimating the vector autoregressive (VAR) model and its residuals, and then applying 
a causal structure learning method to the residuals to learn the contemporaneous causal 
structure. This approach is for example used in [27] . 

Finally, [7] proposed an approach based on Bayesian time series models, applicable to 
large scale systems. 

Causal structure learning from heterogeneous data. There is interesting work on 
causal structure learning from heterogeneous data. For example, one can consider a mix 
of observational and various experimental data sets ESI ST], or different data sets with 
overlapping sets of variables |HS1 El], or a combination of both (HTj. A related line of work 
is concerned with transportability of causal effects |5]. 

Covariate adjustment. Given a DAG and a set of intervention variables X and a set of 
target variables Y, Pearl’s backdoor criterion is a sufficient graphical criterion to determine 
whether a certain set of variables can be used for adjustment to compute the effect of X on 
Y. This result was strengthened by [56] who provided necessary and sufficient conditions. 
In turn, this result was generalized by [6H] who provided necessary and sufficient condi¬ 
tions for adjustment given a MAG. Pearl’s backdoor criterion was generalized to GPDAGs, 
MAGs and PAGs by [3T]. Finally, [15| provided necessary and sufficient conditions for ad¬ 
justment in DAGs, MAGs, GPDAGs and PAGs. 

Measures of uncertainty. The estimates of IDA come without a measure of uncertainty. 
(The regression estimates in IDA do produce standard errors, but these assume that the 
estimated GPDAG was correct. Hence, they underestimate the true uncertainty.) Asymp¬ 
totically valid conhdence intervals could be obtained using sample splitting methods (cf. 
[35]). but their performance is not satisfactory for small samples. Another approach that 
provides a measure of uncertainty for the presence of direct effects is given by HU- More 
work towards quantifying uncertainty would be highly desirable. 

6 Summary 

In this paper, we discussed the estimation of causal effects from observational data. This 
problem is relevant in many helds of science, since understanding cause-effect relationships 
is fundamental and randomized controlled experiments are not always possible. There is a 
lot of recent progress in this held. We have tried to give an overview of some of the theory 
behind selected methods, as well as some pointers to further literature. 

Finally, we want to emphasize that the estimation of causal effects based on observa¬ 
tional data cannot replace randomized controlled experiments. Ideally, such predictions 
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from observational data are followed up by validation experiments. In this sense, such 
predictions could help in the design of experiments, by prioritizing experiments that are 
likely to show a large effect. 
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